Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model based on analog bits is used to model panoptic masks, with a simple, generic architecture and loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our generalist approach can perform competitively to state-of-the-art specialist methods in similar settings.
translated by 谷歌翻译
尽管语言任务自然而然地以单个,统一的建模框架(即生成代币序列)表示,但在计算机视觉中并非如此。结果,对于不同的视力任务,不同的架构和损失功能的扩散。在这项工作中,我们表明,如果根据共享像素到序列界面进行配制,也可以统一一组“核心”计算机视觉任务。我们专注于四个任务,即对象检测,实例分割,关键点检测和图像字幕,所有这些任务都具有各种类型的输出,例如边界框或密集的掩码。尽管如此,通过将每个任务的输出作为具有统一界面的离散令牌的顺序,我们表明可以在所有这些任务上训练具有单个模型体系结构和损失功能的神经网络,而没有特定于任务的自定义。为了解决特定的任务,我们使用一个简短的提示作为任务说明,序列输出适应提示,以便它可以产生特定于任务的输出。我们表明,与成熟的特定任务模型相比,这种模型可以实现竞争性能。
translated by 谷歌翻译
智能建筑中的室内热舒适对乘员的健康和表现有重大影响。因此,机器学习(ML)越来越多地用于解决与室内热舒适的挑战。热舒适感的时间变化是调节居住者福祉和能耗的重要问题。但是,在大多数基于ML的热舒适研究中,不考虑时间中的时间方面,例如一天中的时间,昼夜节律和室外温度。这项工作解决了这些问题。它研究了昼夜节律和室外温度对ML模型的预测准确性和分类性能的影响。数据是通过在14个教室中进行的长达一个月的实地实验收集的,其中512名小学生。四个热舒适度指标被认为是深神经网络的输出,并支持数据集的向量机模型。时间变异性对学童舒适性的影响通过“一天中的时间”分析显示。预测准确性的时间差异已显示(多达80%)。此外,我们表明室外温度(随时间变化)对热舒适模型的预测性能产生了积极影响高达30%。时空环境的重要性通过对比的是微观级别(特定于位置)和宏观级别(整个城市的6个位置)的重要性。这项工作的最重要发现是,对于多种热舒适度指标,显示了预测准确性的明确提高,而天空中的时间和天空照明则有所增加。
translated by 谷歌翻译
室内环境中的热舒适感会对乘员的健康,福祉和表现产生巨大影响。鉴于对能源效率和实现智能建筑的关注,机器学习(ML)越来越多地用于数据驱动的热舒适度(TC)预测。通常,提出了用于空调或HVAC通风建筑物的基于ML的解决方案,这些模型主要是为成年人设计的。另一方面,在大多数国家 /地区,自然通风(NV)的建筑物是常态。它们也是节能和长期可持续性目标的理想选择。但是,NV建筑物的室内环境缺乏热调节,并且在空间环境中差异很大。这些因素使TC预测极具挑战性。因此,确定建筑环境对TC模型性能的影响很重要。此外,需要研究跨不同NV室内空间的TC预测模型的概括能力。这项工作解决了这些问题。数据是通过在5个自然通风的学校建筑中进行的为期一个月的实地实验,涉及512名小学生。空间变异性对学生舒适度的影响通过预测准确性的变化(高达71%)来证明。还通过特征重要性的变化来证明建筑环境对TC预测的影响。此外,对儿童(我们的数据集)和成人(ASHRAE-II数据库)进行了模型性能的空间变异性比较分析。最后,评估了NV教室中热舒适模型的概括能力,并强调了主要挑战。
translated by 谷歌翻译
本文介绍了一个大规模的多模式和多语言数据集,该数据集旨在促进在语言中的上下文使用中对图像进行接地的研究。数据集由选择明确说明在电影字幕句子中表达的概念的图像组成。数据集是一个宝贵的资源,因为(i)图像与文本片段一致,而不是整个句子; (ii)对于文本片段和句子,可以使用多个图像; (iii)这些句子是自由形式和现实世界的; (iv)平行文本是多语言的。我们为人类设置了一个填充游戏,以评估数据集的自动图像选择过程的质量。我们在两个自动任务上显示了数据集的实用程序:(i)填充填充; (ii)词汇翻译。人类评估和自动模型的结果表明,图像可以是文本上下文的有用补充。该数据集将受益于单词视觉基础的研究,尤其是在自由形式句子的背景下,可以从https://doi.org/10.5281/zenodo.5034604获得创意常识许可。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译